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Abstract 



We study the problem of estimating the leading eigenvectors of a 
high-dimensional population covariance nratrix based on independent 
Gaussian observations. We establish a lower bound on the minimax 
risk of estimators under the I2 loss, in the joint limit as dimension 
and sample size increase to infinity, under various models of sparsity 
r^ I for the population eigenvectors. The lower bound on the risk points 

to the existence of different regimes of sparsity of the eigenvectors. 
f^ ' We also propose a new method for estimating the eigenvectors by a 

jrt ' two-stage coordinate selection scheme. 
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)P ; 1 Introduction 

• , Principal components analysis (PCA) is a widely used technique in reduc- 

^— V I ing dimensionality of multivariate data. A traditional setting where PCA is 

Psl ■ applicable involves repeated observations from a multivariate normal distri- 

bution. Two key theoretical questions are: i) what is the relation between 
the sample eigenvectors and the population ones ? and ii) how well can 
k> ■ population eigenvectors be estimated under various sparsity assumptions ? 

^ . When the dimension N of the observations is fixed and the sample size n 

increases to infinity, the asympt otic properties o f the sample eige nvalues 
and eigenvectors are well-known Anderson I . Il963l . iMuirhead I . Il982 |. Most 



of this asymptotic analysis is based on the fact that the sample covariance 
approximates well the population covariance when the sample size is large. 
However, it is increasingly common to encounter statistical problems where 
the dimensionality of the observations is of the same order of magnitude as 
(or even bigger than) the sample size. In such cases, the sample covariance 
matrix, in general, is not a reliable estimate of the population covariance 
matrix. 



To overcome this curse of dimensionality, several works studied the esti- 
mation of the population covariance matrix, under various models of spar- 

sity. These include the d eve lopment of banding and thresholding schemes 

Bickel and Levinal(2008al lb|. lEl Karoui I |2008l |. [Rothman et aTl |2009l ]. ICai and Liu 



201 1[ |. and analysis of their rate of co nver gence in the spectral norm. More 



recent works, such as lCai et al. I 201Cll | and lCai and Zhou I 201 1[ | established 



the minimax rate of convergence under the matrix li norm and the spectral 
norm, and its dependence on the assumed sparsity level. 

In contrast to these works, that studied estimation of the population 
covariance matrix, in this paper we consider a related but different problem, 
namely, the estimation of its leading eigenvectors. The interest in comparing 
these two problems is partially due to the fact that, when the population 
covariance is a low rank perturbation of the identity, which is a primary fo- 
cus of this paper, sparsity of the eigenvectors corresponding to the non-unit 
eigenvalues implies sparsity of the whole covariance. Note that consistency 
of an estiinator of the whole covariance matrix also implies convergence of 
its leading eigenvalues to their population counterparts. If the gaps be- 
tween the neighboring distinct eigenvalues remain bounded awa y from zero 



it also implies convergence of the corresponding eigen-subspaces lEl Karoui 



20081 1 . Moreover, for population eigenvalues with multiplicity one and gaps 
with neighboring eigenvalues bounded away from zero, the upper bounds 
for the whole covariance estim ation under the spec tral norm, derived in 



Bickel and Levina I J2008bl | and ICai and Zhou I |201l[, also yield an upper 



bound on the rate of convergence of the corresponding eigenvectors under 
the I2 loss. These works, however, did not study the following fundamental 
problem, considered in this paper: How well can the leading eigenvectors be 
estimated, namely, what are the minimax rates for eigenvector estimation ? 
We formulate this eigenvector estimation problem under the well-studied 
"spiked population model" which assumes that 

(*) the eigenvalues of the population covariance matrix S are 
Al + (7 , . . . , Am -I- cr , (T ,... ,a , 
for some M > 1, where ct^ > and Ai > A2 > • • • > \m > 0. 

This is a standard model in sev eral scientif i c field s, including for example 
array signal processing (e.g. see Ivan Trees I 20021 ]) where the observations 



are modeled as the sum of an M-dimensional random signal and an indepen- 
dent, isotropic noise. It also arises as a latent variab l e model for multivari- 
ate data, for example in factor analysis Jolliffe 1 . 120021 . iTipping and Bishop I . 



1998( 1 ■ The assumption that the leading M eigenvalues are distinct is made 
to simplify the analysis, as it ensures that the corresponding eigenvectors 
are identifiable up to a sign change. The assumption that all remaining 
eigenvalues are equal is not crucial as our analysis can be generalized to 
the case when these are only bounded by o"^ . Asymptotic properties of the 
eigenvalues and eigenvectors of the sample covariance matrix under this 
model, in the setting when N/n — )• c £ (0, oo) as n — >• oo, have been 
studie d bvlBaik and Silverstein I |2006l ] . iNadleTl |2008l ] . lOnatskil |2006l | and 



Paul I 20071 ]. among others. A conclusion of these studies is that when 



N/n — 7- c > 0, the eigenvectors of standard PCA are inconsistent estimators 
of the population eigenvectors. 

In analogy to the sparse covariance estimation setting, several works 
considered various models of sparsity for the leadi ng eigenvectors and de 
yelop ed im proved sparse est imators. For example IWitten and Tibshirani 



20091 1 and IZou et al. I 20061 ]. among others, imposed /i-type sparsity con- 
straints directly on the eigenve ctor estimates and prop osed optimization 



procedures for obtaining them. iShen and Huang I [20081 ] suggested a regu 



larized low rank approach to sparse PCA. The consistency of the resulting 



leading eigenvectors was recently proven in IShen et al. I 20111 ]. using a for- 
mulation of sparsity in w hich the sample size n is fixed while iV — )• oo. 



d'Aspremont et al. 



2008f ] suggested a semi-definite programming (SDP) 
proble m as a relaxation to the ^n-pe naltv for sparse S. Assuming a single 



spike, lAmini and Wainwright I |2008l ] studied the asymptotic properties of 



the le ading eigenvector of the covariance estimator obtained by ld'Aspremont et al. 
20081 ]. in th e joint limit as both sam p le siz e and dimension tend to infinity. 



Specifically, lAmini and Wainwright I [20081 ] considered a leading eigenvec- 
tor with exactly k <^ N nonzero entries all of th e form | — l/yfc, l/yfe). 



For th is hardest subproblem in the /c-sparse ^o-ball, lAmini and Wainwright I 
20081 ] first derived information theoretic lower bounds, and then, under the 
assumption that the SDP problem has a rank one solution, proved that it 
attains the optimal rate of convergence. 



In this paper, in contrast, following I Johnstone and Lu I 20091 ] we study 



the estimation of the leading eigenvectors of E assuming that these are ap- 

proxim ately sparse, with a bounded Iq norm. Under this model. I Johnstone and Lu 
2009[ ] developed an estimation procedure based on coordinate selection by 



thresholding the diagonal of the sample covariance matrix, followed by the 
spectral d ecomposition of the subm atrix corresponding to the selected co- 



ordinates. iJohnstone and Lu I [20091 ] further proved consistency of this es- 



timator assuming dimension grows at most polynomially with sample size, 
but did not study its convergence rate. Since this estimation procedure is 



considerably simpler to implement and computationally much faster than 
the li penalization procedures cited a bove , it is of interest to understand its 
theoretical properties. More recently, iMa I 201 ll ] developed a related scheme 
named ITSPCA (iterative thresholding sparse PCA) which is based on re- 
peated application of filtering, thresholding and orthogonalization steps that 
result in sparse estimators of the subspaces spanned by the leading eigen- 
vectors. He also proved consistency and derived rates of convergence of the 
proposed estimator under appropriate loss functions and sparsity assump- 
tions. 



In this paper, which is p artly based on the Ph.D. thesis iPaul I [2005l | and 
Paul and Johnstone I [20071 ] , we study the estimation of the lea ding eigen- 
vectors of S within the framework of I Johnstone and Lu I 20091 ] . but with 
an arbitrary number of spikes (i.e., M > 1) whose corresponding eigen- 
vectors all belong to appropriate l^ spaces. Our analysis thus extends the 
setting studied in Johnstone and Lu I 20091 ] and complements the work of 



Amini and Wainwright I 20081 ] that considered the /o-sparsity setting. For 



simplicity, we assume Gaussian observations in our analysis. However, up 
to multiplicative constants, the bounds on the minimax rate reported in 
this paper continue to hold under a relaxed assumption of sub-Gaussian tail 
behavior for the probability distributions of the random variables. 

The main contributions of this paper are as follows. First, we establish 
lower bounds on the rate of convergence of the minimax risk for any eigenvec- 
tor estimator under the I2 loss. This analysis points to three different regimes 
of sparsity, which we denote as dense, sparse, and ultra-sparse, each having 
a different rate of convergence. We show that in the "dense" setting (as 
defined in Section [3]), the standard PCA estimator attains the optimal rate 
of convergence, whereas in sparse settings it is not eve n consistent. Next 



we sh ow that while the diagonal thresholding scheme of I Johnstone and Lu 



2OO9I ] is consistent under these sparsity assumptions, in general, it is not 
rate optimal. This motivates us to propose a new method (Augmented 
Sparse PCA, or ASPCA) for estimating the eigenvectors that is based on 
a two-stage coor dinate selection scheme, a nd is a refinement of the thresh- 



olding scheme of I Johnstone and Lu I [2009[]. While beyond the scope of this 



paper, it is possible to show that in the ultra-spar se setting, both our AS- 
PCA procedure, as well as the method of lMa I [201 ll ] achieve the lower bound 
on the minimax risk obtained by us, and are thus rate-optimal procedures. 
There is an intermediate region where a gap exists between the current lower 
bound and the upper bound on the risk. It is an open question whether the 
lower bound can be improved in this scenario, or a better estimator can be 
derived. Table [1] provides a comparison of the lower bounds and rates of 



Estimator 


dense 


sparse 


ultra-sparse 


Lower bound 


0{N/n) 


0(n-(i-^/2)) 


0((logAf/n)i-«/2) 


PCA 


rate optimal 


inconsistent 


inconsistent 


D.T. 


inconsistent 


not rate optimal 


not rate optimal 


ASPCA 


inconsistent 


? 


rate optimal 



Table 1: Comparison of Lower Bounds on eigenvector estimation and Worst 
Case Rates of various procedures. 



convergence of various estimators. 

The theoretical results also show that under comparable scenarios, the 
optimal rate of convergence for eigenvector estimation, 0((log A^/n)"'--'^"'?/^') 
(under squared-error loss) is faster than the optimal rate for covariance esti- 
ma tion, 0((log A"/n)~(^~'^)) (u nder squared operato r norin loss), as obtained 
by jBickel and Levina I . l2008bl | and ICai and Zhou I 20 111 ] . Finally, we em- 
phasize that to obtain good finite-sample performance for both our two-stage 
scheme, as well as for other thresholding methods, the exact thresholds need 
to be carefully tuned. This issue and the detailed theoretical analysis of the 
ASPCA estimator is beyond the scope of this paper, and will be presented 
in a future public ation. Afte r this paper was complete d, we learned of 
Vu and Lei I 20121 ]. which cites iPaul and Johnstone I 2007] and contain s re- 
sults overlapping with some of the work of iPaul and Johnstone I |2007| | and 
this paper. 

The rest of the paper is organized as follows. In Section [2l we describe 
the model for the eigenvectors and analyze the risk of the standard PCA 
estimator. Li Section [3l we present the lower bounds on the minimax risk 
of any eigenvector estimator. In Section HI we derive a lower bound on the 



risk o f the diagonal thresholding estimator proposed by iJohnstone and Lu 



20091 ]. In Section[5l we propose a new estimator named ASPCA (augmented 



sparse PCA) that is a refinement of the diagonal thresholding estimator. In 
Section [6l we discuss the question of attainment of the risk bounds. Proofs 
of the results are given in Section [A] in the Appendix. 



2 Problem setup 

First we introduce certain notations. Throughout, S^~^ denotes the unit 
sphere in R centered at the origin, [xj denotes the largest integer less than 
or equal to x G R. 



Let {Xi : i = 1, . . . , n} be a triangular array, where for each n, the N xl 
random vectors Xi := X",i = l,...,n are independent and identically 
distributed on a common probability space. Throughout we assume that 
Xj's are i.i.d. as A^(0, S), where the population matrix S is a finite rank 
perturbation of (a multiple of) the identity. In other words, 

M 
S = J^A,MJ + ct2/, (1) 

where Ai > A2 > • • • > Am > 0, and the vectors 9i, . . . , 9m are orthonormal, 
which implies (*). 9^ is the eigenvector of S corresponding to the z/-th 
largest eigenvalue, namely, \y + a^. The term "finite rank" means that M 
remains fixed even as n — ?• 00. The asymptotic setting involves letting both 
n and N grow to infinity simultaneously. For simplicity, we assume that the 
Ajy's are fixed while the parameter space for the ^i^'s varies with N. 
The observations can be described in terms of the model 

M 

Xjk = y^ \fKvyiQvk + tr^ifc, i = l,...,n, /c = l,...,iV. (2) 

Here, for each n, v^i^ Zi^ are i.i.d. A^(0, 1). Since the eigenvectors of S are 
invariant to a scale change in the original observations, it is assumed that 
a = 1. Hence, Ai, . . . , Xm in the asymptotic results should be replaced by 
Ai /cr^, . . . , Aj\,//cr^ when ([T|) holds with an arbitrary cr > 0. Since the main 
focus of this paper is estimation of eigenvectors, without loss of generality 
we consider the uncentered sample covariance matrix S := -XX , where 
X=[Xi:...:Xn]. 

The following condition, termed Basic Assumption, will be used through- 
out the asymptotic analysis, and will be referred to as BA. 

BA ^ holds with a = 1; N = N{n) -^ 00 as n -^ 00; Ai > . . . > Aa/ > 
are fixed (do not vary with N), where M is unknown but fixed. 

2.1 Eigenvector estimation with squared error loss 

Given data {Xi}f^^, the goal is to estimate M and the eigenvectors 61, ... , 6m- 
For simplicity, to derive the lower bounds, we first assume that M is known. 
In Section [5.21 we derive an estimator of M, which can be shown to be con- 
sistent under the assumed sparsity conditions. To assess the performance 
of any estimator, a minimax risk analysis approach is proposed. The first 
task is to specify a loss function L{6y,9^) between the estimated and true 

6 



eigenvector. Since the model is invariant to sign changes of each Oy, we 
consider the following loss function, also invariant to sign changes. 

L(a,b) := 2(1 - |(a,b)|) =|| a.- sign{{s.,h))h f, (3) 

where a and b are A^ x 1 vectors with unit I2 norm. An estimator Oy is called 
consistent with respect to L, if L{9u, 9y) — )• in probability as n — >■ 00. 

2.2 Rate of convergence for ordinary PCA 

We first consider the asymptotic risk of the leading eigenvectors of the sam- 
ple covariance matrix (henceforth referred to as the standard PCA estima- 
tors) when the ratio N/n is small. Specifically, it is assumed that N/n — >• 
as n — )• 00. 

For future use, we define 

h{\) := ^ A > 0, (4) 

and 

In Ijohnstone and Lu I 2009f | (Theorem 1) it was shown that under a sin- 



gle spike model, as N/n — )• 0, the standard PCA estimator of the leading 
eigenvector is consistent. The following result, proven in the Appendix, is a 
refinement of that, as it also provides the leading error term. 

Theorem 1 Let Oy^pcA be the eigenvector corresponding to the v-th largest 
eigenvalue ofS. Assume that BA holds and N,n ^ 00 such that N/n — )■ 0, 
and moreover, log n = o{N) . Then, for each u = 1, . . . , M , 



sup ¥.L{e^^pcA,Ou) 



N -M 1 ^-^ 1 






nh{K) n^^g{X^,X^) 



(1 + 0(1)). (6) 



Remark 1 Observe that Theorem [7] does not assume any special structure 
(e.g., sparsity) for the eigenvectors. The first term on the RHS of ^ is 
a nonparametric component which arises from the interaction of the noise 
terms with the different coordinates, while the second term is a paramet- 
ric component which results from the interaction with the remaining M — 1 
eigenvectors corresponding to different eigenvalues. The second term shows 
that the closer the successive eigenvalues are, the larger is the estimation 



error. The upshot of (S^ is that standard PCA provides a consistent estima- 
tor of the leading eigenvectors of the population covariance matrix when the 
dimension-to-sample-size ratio (N/n) is asymptotically negligible. 



2.3 Iq constraint on eigenvectors 

As shown by various authors Nadler I . l2008l . lOnatski I . l2006l . IPaul 1 . 120071 ] , 
when N/n — )• c S (0,oo], standard PCA provides inconsistent estimators for 
the population eigenvectors. In this subsection we consider the foUowing 
model for approximate sparsity of the eigenvectors. For each z/ = 1, . . . , M, 
we assume that Oi, belongs to an Ig ball with radius C, for some q G (0, 2). 
Specifically, we assume that 9i, £ @q{C), where 



QqiC) := {a G 



jAf-l 



N 

E 

fc=l 



ak\' < Cn- 



(7) 



Note that our condition of sparsity is slightly different from that of lJohnstone and Lu 
2009l |. 



Note that since < g < 2, for @q{C) to be nonempty, one needs C > 1. 
Further, if C^ > iV^'^/^^ then the space Qq{C) is all of S^"^ because in this 
case, the least sparse vector ^t(1, 1, . . . , 1) is in the parameter space. 

The parameter space for := [^i : . . . : 6m] is denoted by 



M 



Q^{Ci,...,CM)-={eel[Qq{C,) : {9,,9,,)=0, for Uy^u'}, (8) 



u=l 



where @q{C) is defined through d?]), and C^ > 1 for all i/ = 1, . . . , M. 

Remark 2 While our focus is on eigenvector sparsity, condition (S^ also 
implies sparsity of the covariance matrix itself. In particular, for q G (0, 1), 
a spiked covariance matrix satis fying [E) also belong s to the c lass of sparse 
covar iance matrices analyzed byJBickel and Levina] '2008W. \Cai and Liu\ 
201m] and I Cai and Zhou I 1201 j /- Indeed, I Cai and Zhou I \20lA] obtained 
the minimax rate of convergence for covariance matrix estimators under 
the spectral norm when the rows of the population matrix satisfy a weak-lq 
constraint. However, as we will show below, the minimax rate for estimation 
of the leading eigenvectors is faster than that for covariance estimation. 



3 Lower bounds on the minimax risk 

We now derive lower bounds on the minimax risk of estimating 9i, under the 
loss function ([3|). To aid in describing and interpreting the lower bounds, 
we define the following two auxiliary parameters. The first is an effective 
noise level per coordinate 

tI = l/{nh{X,)) (9) 

and the second is an effective dimension 

m, := Aq{CjT,Y (10) 

where a^ := {2/9)^''^/'^, ci := log(9/8) and Ag := l/(agcf^) and C^ := 
a -I. 

The phrase effective noise level per coordinate is motivated by the risk 
bound in Theorem [U since dividing both sides of ([6]) by N , the expected 
"per coordinate" ri sk (or ya.rianc e) of the PCA estimator is asymptotically 
r^. Next, following iNadler I 2009[], let us provide a different interpretation of 



Ty. Consider a sparse 0^ and an oracle that, regardless of the observed data, 
selects a set Jr of all coordinates of 9^ that are larger than r in absolute 
value, and then performs PCA on the sample covariance restricted to these 
coordinates. Since 6^ G ©q(Cj/), the maximal squared-bias is 

N N 

sup ^ \9uk?' ^ supj^x/^ : ^a^fc < C^,maxxfc < r'',min2;A: > 0} 

which follows by the correspondence Xk = \9uk\'^, and the convexity of the 
function X^^^i^;/'^- On the other hand, by Theorem [H the maximal vari- 
ance term of this oracle estimator is of the order kT-/{nh{\y)) where kr is 
the maximal number of coordinates of 9i, exceeding r. Again, 9^ G Qq{Cu) 
implies that kj- x CyT^'^. Thus, to balance the bias and variance terms, we 
need r x 1/ ^Jnh{\y) = Ty. This heuristic analysis shows that Ty can be 
viewed as an oracle threshold for the coordinate selection scheme, i.e., the 
best possible estimator of 9y based on individual coordinate selection can 
expect to recover only those coordinates that are above the threshold Ty. 

To understand why ruy is an effective dimension, consider the least sparse 
vector 9y S Qq{Cy). This vector should have as many nonzero coordinates 
of equal size as possible. If Cy > N^^'^''^ then the vector with coordinates 



±N~^''^ does the job. Otherwise, we set the first coordinate of the vector 
to be Vl — r'^ for some r G (0, 1) and choose all the nonzero coordinates to 
be of magnitude r^. Clearly, we must have r^ = mr^, where 771 + 1 is the 
maximal number of nonzero coordinates, while the Ig constraint implies that 
(1 — r^)''' ^ + mTu < Cy. The last inequality shows that the maximal m is 
just a constant multiple of m^j. This construction also constitutes the key 
idea in the proof of Theorems [2] and El Finally, we set 

N' = ci{N -M), (11) 

where the origin of ci = log (9/8) will be explained in the proof. 

Theorem 2 Assume that BA holds, < q < 2, and n,N ^ 00. Then, 
there exists a constant Bi > such that for n sufficiently large, 

Rl := inf sup EL(^^, 0^) > Bi6n, (12) 

e^ 05(C) 

where 5n is given by 

{T^N' if T^N' < 1 and N' < m^, [dense setting] 

r'^ruy if r'^mu < 1 and lUy < N' [sparse setting] 
1 if Ty • min{A^',7Ti,y} > 1 [weak signal]. 

We may think of nin '■= uim{N' ,m^} as the effective dimension of the 
least favorable configuration. In the sparse setting, m„ = AqC^[nh{Xi^)]'^''^ < 
ciN (i.e., Cyu'^i'^ < c'N for some c' > 0), and the lower bound is of the order 

On the other hand, in the dense setting, m^ = ci{N — M). If N/n — )• c for 
some c > 0, then 5n = ci{N — M)/{nh{Xu)) x 1, and so any estimator of 
the eigenvector 9i, is inconsistent. If N/n — )■ then the lower bound is 

.„.£lH_i£)„^. (14) 

nh[\y) n 

Eq. ()14p and Theorem [1] imply that in the dense setting with N/n — )■ 0, the 
standard PCA estimator O^^pcA attains the optimal rate of convergence. 

A sharper lower bound is possible in what we call an ultra-sparse setting 
which happens if Cin'^''^ = 0(A^^~°) for some a G (0, 1). In this case the 
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dimension N is much larger than the quantity C^n''' ^ measuring the effective 
dimension. Hence, we define a modified effective noise level per-coordinate 

_2 _ a log N 



9nh{Ky 
and a modified effective dimension 

Theorem 3 Assume that BA holds, < q < 2, and n,N ^ oo such that 
i^u = 0{N^^'^) for some a G (0,1). Then, assuming that rh^f^ < 1 for n 
sufficiently large, the minimax bound (|12|) holds with 

- -2 -1 / logiV \ 1-9/2 
6n = myTy = a„ Cl[ —r-7z-^ ■ [ultra-sparse setting] (15) 

^ \nh[Xiy)/ 

Note that in the ultra-sparse setting 6n is larger by a factor of (log A^)^^^'^ 
compared to the sparse setting, Eq. ([13]). 



4 Risk of the diagonal thresholding estimator 

In this section, we analyze the convergence rate of the SPCA scheme (hence- 
forth referred to as the d iagonal thresholding or D.T. scheme) proposed by 



Johnstone and Lu I |2009l | . In this section and in Section [SI we assume for 
simplicity that N > n. Let the sample variance of the fc-th coordinate (i.e., 
the A;-th diagonal entry of S) be denoted by S^fc. Then the D.T. scheme 
consists of the following steps. 

1. Define / = /(7n) to be the set of indices k E {1, . . . ,N} such that 
Sfcjfc > 7n. for some threshold 7„ > 0. 

2. Let S// be the submatrix of S corresponding to the coordinates I. Per- 
form an eigen-analysis of S//. Denote the eigenvectors by fi, . . . , fmin{n.,|/|}- 

3. For I' = 1, . . . , M, estimate 0^, by the A^ x 1 vector f^, obtained from 
f^ by augmenting zeros to all the coordinates in I^ := {1, . . . , N} \ I. 



Assuming that 9,^ G 9q(C/). [Johnstone and Lu I 2009[ | showed that the 



D.T. scheme with a threshold of the form 7„ = 1+^y^log N/n for some 7 > 
leads to a consisten t estimator of 9,^. The ris k of this estimator, however, 
was not analyzed in [Johnstone and Lu [ 20091 ]. As we prove below, the risk 
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of the D.T. estimator is not rate optimal. This can be anticipated from 
the lower bound on the minimax risk (Theorems [2] and [3|) which indicate 
that to attain the optimal risk, a coordinate selection scheme must select all 
coordinates of 61, of size at least Cy/logN/n. With a threshold of the form 
7n above, however, only coordinates of size (log N/n)^'^ are selected. As 
shown in the following theorem, even for the case of a single signal (M = 1) 
this leads to a much larger lower bound. 

Theorem 4 Suppose that BA holds with M = 1. Let C > 0, < g < 2, 

and n,N —)■ 00 be such that C'^n'^'^ = o(m axK/n, A^j) . Then the D iagonal 
Thresholding estimator ^i__dt proposed by \ Johnstone and Lu I \200A ] satis- 
fies, for any q G (0,2), 

sup EL(^i,OT,^i) >i^9C"?n~^(^~^/2) ^ig) 

for a constant Kg > 0, where C"^ = C"^ — 1. 

Comparing ()16p with the lower bound ()13p . shows the large gap between the 
two rates, n~^''^^^~'^''^' vs. n~''^~'^''^' . The reason for this difference is that 
the D.T. scheme uses only the diagonal of the sample covariance matrix S, 
ignoring the information in its off-diagonal entries. In the next section we 
propose a refinement of the D.T. scheme, denoted ASPCA, that constructs 
an improved eigenvector estimate using all entries of S. 

5 A two stage coordinate selection scheme 

As discussed above, the DT scheme can reliably detect only those eigenvector 
coordinates \9u,k\ = 0{{logN/n)^''^), whereas to reach the lower bound one 
needs to detect those coordinates of size \6u,k\ = 0{{logN/n)^''^). 

To motivate an improved coordinate selection scheme, consider a parti- 
tion of the A^ coordinates into two sets A and B, where the former contains 
all those k such that \9ik\ is "large" (selected by the D.T. scheme), and the 
latter contains the remaining smaller coordinates. Partition the matrix S 
as 

y, ^ ^AA ^AB 
l^BA ^BB 

Observe that, T,ba = ^iGi,b()i a- ^^^ ^1 ^^ ^ "preliminary" estimator of di 
such that lim.n^oa^i{Si,A,Gi,A) > <^o) = 1 for some (5o > (e.g., 61 could be 
the D.T. estimator). Then we have the relationship, 

^BaOi,A = {Ol,A,dl,A)>^lOl,B ~ c{6o)Xi0i,B 
12 



for some c{6q) bounded below by 6o/2, say. Thus, one possible strategy 
is to additionally select all those coordinates of T,ba(^i,a that are larger 
(in absolute value) than some constant multiple of \/log N/y^nh(Xi). In 
practice we do not know T,ba or Ai but we can use S^^ as a surrogate for the 
former and the largest eigenvalue of S^^ to obtain an estimate for the latter. 
A technical challenge is to show, that with probability tending to 1, such a 
scheme indeed recovers all coordinates k with \6ik\ > c\ Vlog N j \lnh{\\ ) , 
while discarding all coordinates k with \B\-k\ < C2\/\og N / ^ nh{\i) for some 
constants ci > C2 > 0. Figure 1 provides a pictorial description of the D.T. 
and ASPCA coordinate coordinate selection schemes. 

5.1 ASPCA scheme 

Based on the ideas described above, we now present the ASPCA algorithm. 
It first makes two stages of coordinate selection, whereas the final stage 
consists of an eigen-analysis of the submatrix of S corresponding to the 
selected coordinates. The algorithm is described below. 
For any 7 > define 

/(7) = {A::Sfc, >l+7}. (17) 

Let 7i > for i = 1,2 and k > be constants to be specified later. 
Stage 1 

1° Let / = I(7i,n) where 71,^ = -fi^/log N/n. 

2" Denote the eigenvalues and eigenvectors of Sjj by £1 > ... > Imi and 
fi, . . . , frni respectively, where ttt-i = minjn, |/|}, 

3° Estimate M by M defined in Section O 
Stage 2 

4° Let E = [£7^/^fi • • • ^^^^^fjv/] and Q = S/./E. 

5" Let J = {A; / : (QQ^)A;fc > l2,n} ^^i some j2,n > 0. Define 
K = IUJ. 



Stage 3 



6° For 1/ = 1,...,M, denote by 61, the z/-th eigenvector of Skk, aug- 
mented with zeros in the coordinates K'^. 
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Remark 3 The ASPCA scheme is specified up to the choice of parameters 
7i;72,n CLiT-d n, that determine its rate of convergence. It can he shown that 
choosing 71 = 4, k = -v/2"+~e for some e > 0, and 72 « given by 




72,n = 72 A/^— + -V— (18) 



with 72 = ky^3/2 results in an asymptotically optimal rate. Again, we note 
that for finite N, n, the actual performance in terms of the risk of the result- 
ing eigenvector estimate may have a strong dependence on the threshold. In 
practice, a delicate choice of thresholds can he highly beneficial. This issue, 
as well as the analysis of the risk of the ASPCA estimator, are beyond the 
scope of this paper and will he studied in a separate publication. 

5.2 Estimation of M 

Estimation of the dimension of the signal subspace is a classical problem. 
If the signal eigenvalues are strong enough (i.e., Xi, > C\jN jn for all v = 
1, . . . , M, for some c > 1 independent of A^, n), then nonparametric methods 
that do not assum e eigenvector sparsity can asy mptotically estimate the 



correct M (see, e.g. iKritchman and Nadler I 20081 ]). When the eigenvectors 



are sparse, we can detect much weaker signals, as we describe below. 

We estimate M by thresholding the eigenvalues of the submatrix S// 
where / := I{^yJ\ogN/n) for some 7 > 0. Let fh = min{n, |I|} and Ii > 
... > £m be the nonzero eigenvalues of Sjj. Let «„ > be a user-defined 
threshold. Then, define M by 

M := max{l < A; < m : 4 > 1 + a„}. (19) 

It can be shown that under appropriate sparsity conditions, with a suitable 
choice of threshold On, M is a consistent estimator of M . 

6 Summary and Discussion 

In this paper we derived lower bounds on eigenvector estimates under three 
different sparsity regimes, denoted dense, sparse, and ultra-sparse. In the 
dense setting, Theorems [1] and [2] show that when N/n — )• 0, the standard 
PCA estimator attains the optima l rate of convergence. In the ultra-sparse 



setting. Theorem 3.1 of iMa I [201 ll | shows that the maximal risk of the IT- 



SPCA estimator proposed by him attains the same asymptotic rate as the 
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corresponding lower bound of Theorem [3l This imphes that in the ultra- 
sparse setting, the lower bound on the minimax rate is indeed sharp. In 
a separate paper, we prove that in the ultra-sparse regime, the ASPCA 
algorithm also attains the minimax rate. 

Finally, our analysis leaves some open questions in the intermediate 
sparse regime. According to Theorem [2l the lower bound in this regime 
is smaller by a factor of (log A^)^"'''^, as compared to the ultra-sparse set- 
ting. Therefore, whether there exists an estimator (and in particular, one 
with low complexity), that attains the current lower bound, or whether this 
lower bound can be improved is an open question for future research. 

A Proofs 

A.l Asymptotic risk of the standard PCA estimator 

To prove Theorem [H on the risk of the PCA estimator, we use the following 
lemmas. 

Deviation of extreme eigenvalues of Wishart matrices 

In our analysis, we shall need a probabilistic bound for deviations of || 
-ZZ — / II- This is given in the following lemma, proven in Section iBl 



Lemma A.l Let tn = 8(A^„/n) y^log Nn/Nn where Nn = max{n, A^}. Let 
Z be an N X n matrix with i.i.d. A^(0, 1) entries. Then for any c > 0, there 
exists ric > 1 such that for all n > Uc, 

P(|| -ZZ^'-In \\>- + 2\l^ + ctn] <2N-'\ (A.l) 

\ n n \ n I 

Deviation of quadratic forms 



The following lemma is due to iJohnstone I 2001 ] . 



Lemma A. 2 Let Xn denote a Chi-square random variable with n degrees of 
freedom. Then, 

l>n{l + e)) < e-3neVi6 (0 < 6 < 1), (A.2) 

l<n{l-e)) < e-"^'/^ (0 < e < 1), (A.3) 



F{xl > n(l + e)) < ^e-'^^'/^ (0 < e < 1/2, n > 16). (A.4) 
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The following lemma is from I Johnstone and Lu I 2009l | 



Lemma A. 3 Let yu,y2i,i = l,...,n be two sequences of mutually inde- 
pendent, i.i.d. N(0,1) random variables. Then for large n and any b s.t. 
< 6 < ^, 



I - Vl yiiy2i\ > \/b/n < 2 exp <^ - 



Perturbation of eigen-structure 



36 



+ 0{n''b')}. 



(A.5) 



The following lemma from iPaul I 20051 ] is convenient for risk analysis of 
estimators of eigenvectors. Several varia nts of this l emma appear in the 
literature, most based on the approach of iKato I 1980(1 . 



Lemma A. 4 Let A and B be two symmetric in x m matrices. Let the 
eigenvalues of matrix A be denoted by \i{A) > . . . > \m{A). Set Ao(^) = 
oo and Am+i(^) = — oo. For any r £ {1, . . . ,ni}, if Xr{A) is a unique 
eigenvalue of A, i.e., if \r-i{A) > \r{A) > A,.+i(^), then denoting by Pr 
the eigenvector associated with the r-th eigenvalue, 

PriA + B)- sign{pr{A + Bf^pr{A))pr{A) = -Hr{A)Bpr{A) + Rr (A.6) 

where Hr{A) := J2sy^r \ (A)-x (A) -^^'>(A) and P£^{A) denotes the projection 
matrix onto the eigenspace Eg corresponding to eigenvalue \s{A,) (possibly 
multi- dimensional) . Define A^ and A^ as 

1 



A^ 

A^ 



Hr{A)B II +\\r{A + B)- \r{A)\ \\ Hr{A) 

II B II 



m\ni<j^r<m |Aj(^) - K{A)\ 
Then, the residual term Rr can be bounded by 

II Rr II < minjlOA^, 

2Ar(l + 2A^) II Hr{A)Bpr{A) 



(A.7) 
(A.8) 



Hr{A)Bpr{A) 



+ 



_l-2A^(l + 2Ar) (l-2A^(l + 2Ar))2 
where the second bound holds only if Ar < (v5 — l)/4- 



A.9) 
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Remark A.l We can simplify the bound on the perturbation in IIA.9\} to 
show that if A^ < 1/4, then 

II Rr \\< C II HriA)Bpr{A) \\ A^ (A.IO) 

where we can take C = 30. To see this, note that \Xr{A+ B) — Xr{A)\ <\\ B \\ 
and that \\ Hr{A) ||< [miuj^r l^jiA) — Xr{A)\]~^ , so that, 

Ar- <|| Hr{A) nil B ||< A,.. 

Now, defining 5 := 2Aj.(l + 2Ar) and (3 :=|| Hr{A)B'Pr{A) \\, we have 
lOA,-, < (5/2)5^, and the bound \A.9^) may be expressed as 

R^ \\< -- — mm <^ !^ -, 1 + --— — - } . 

" "^ "- l-(5 \2 /3 ' d{l-6)] 

For X > 0, the function x i— t- min{5x/2, 1 + 1/x} < 5/2. Further, if A^ < 
1/4, then 6 < 3Ar < 3/4 and so we conclude that 

\\ Rr\\< 10/3(5 < 30/3 A,.. 

For notational simplicity, throughout this subsection, we write 9^ to 
uieanOi^^PCA- Recall that the loss function L(^^, 0j,) =|| 6y—sig\i{9y,9y)6y |p. 
Invoking Lemma lA.41 with A = Tj and i3 = S — S we get 

i - sign(^^, 9^)9^ = -H^S9^ + R^, (A.ll) 

where 

where P± = I - E^li ^f^dj^- Note that H^,9,, = and that H^W^ = 0. The 
key quantity in bounding the error term Ri^ is 

Aj, = max{(Aj, - Ai,+i)"\ (A,,_i - A^^)^^} || S - S || . 

Indeed, from (|A.10p . when A,^ < 1/4, we have, for some constant C > 0, 

II Ru ll< C II Hj,^9j, II A^. 



Set 5'^^j = CAy. We will show that as n — )• oo, 5'^^ — )■ with probability 
approaching 1 and 

II H,S9, f (1 - C)' < L{9,, 9,) <\\ H,S9, f (1 + C)'- (A.13) 
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Theorem [T] then follows from an (exact, non-asymptotic) evaluation 

nh[Xu) ri ^^ (V ~ ^i^) 

We begin with the evaluation of (|A.14p . First we derive a convenient repre- 
sentation of HySOy. In matrix form, model ([2|) becomes 

M 

X = 5^ \/V^A«^M + Z- (A.15) 

For z^ = 1, . . . , M, define 

zy = Z^Oy, wy = X.^9y = y^vy + zy. (A. 16) 

Define 



1 
(a, b)„ •= ~ X] ""^^^ ^^^ arbitrary a, b G W. (A. 17) 

Then we have 

1 ^^ 1 

Se*!, = -XWj, = y^ \/\i{Vn,Wy)n9a + -'LWu- 

n ^ V f- f* '^ n 

Using (ETTBI) . 

"^ Ao — Ay Ay 

Using (IA.12p . HyO^ = (A^ — \y)~^9^ for /U / i/, and we arrive at the desired 
representation 

HySOy = V i^Vii^g _ 1 p^z«;,. (A.18) 

^ A„ - A,, '^ nXy 

By orthogonality, 

II HySOy r= V Z"^^'^;^;, + -L^nFyZ'p^Zujy. (A.19) 

Now we compute the expectation. One verifies that Zy ~ A^(0, In) indepen- 
dently of each other and of each Vy ~ A^(0, /„), so that Wy ~ A^(0, (l-|-A,y)/n) 
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independently. Hence, for ^u / i^, 



¥.{w^,Wy)'^n = n ^Etr (wuw'^w^wj^) 



n-2tr((l + A^)(l + A,)/„) 

n-\l + \^){l + X,). (A.20) 



From ()XT6]) . 



E[w'^Z'^P^Zw^\Z] = zlZ^P^Zz^ + X^E[vlZ^P^Zv^\Z] 

= tT{zz^p^zz^e^el) + KtT{p^zz^). 

Now, it can be easily verified that if W := ZZ ~ VFAr(n, /), then for 
arbitrary symmetric N x N matrices Q, R, we have, 

EtriWQWR) = n[tT{QR) + tr(Q)tr(i?)] + nhi{QR). (A.21) 

Taking Q = P± and R = O^Oj^, by (|A:2T]1 we have 

E[wlZP^Zwy\ = ntr(Pj_) + nA^tr(Pj_) = n{N - M)(l + A^). (A.22) 

Combining ()A.20p with ()A.22p in computing the expectation of ()A.19p . we 
obtain the expression (IA.14P for E || H^S9^ |p. 

Bound for || S - S || 

We begin with the decomposition of the sample covariance matrix S. Intro- 
duce the abbreviation ^^ = n~^Zv^. Then, 

M M M 

s = E E v'vv (^A«' v)n^^^J + E ^(^M^M + ^M^M ) + ^''zz^ 

(A.23) 
and hence 

M M 

||s-s|| < E E V'^^^K^M' v)n -"^MM'! 

M 

+2Y^y%U^\\ + \\n-^ZZ^-I\\, (A.24) 
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where (5^^/ denotes the Kronecker symbol. Let Di be the intersection of all 
the events (for some constant c > 0): 



Du ■■= {\ II v^ \\l -l\ < 2c^/n-nogn, 1 < /i < M}, 
-Di2 := {\{v^,,Vu)n\<c^/n~^^ogn,l<|J,^|J,'<M}, 

Du := {II Cm II< (1 + 2cVN-Hogn)^^, 1 < ^ < M}. 

Since Vi, ~' A^(0, /„) independent of Z, we have Zv^/ \\ v^ ||~ N{0,I]\f) 
independently of v^, and || v^ |p~ Xn- Moreover, 



i^ii n {II Zv,, f I II v^ f< 1 + 2cVA^-ilogn, 1 < ^ < M} C Z^is- 
Hence, we use Lemmas IA.2I and IA.3I to prove that 

1 - P(Di) < 3Mn-^' + M(M - i)n-(3/2)c2+0("-^ log^^). (A.25) 

Define D2 to be be the event that 

D2 := \ II -ZZ^ _ /^ ||< i^ + 2W— + ct„ I , (A.26) 

\ n n \ n 



with i„ as in Lemma FA.ll with Nn = max{n, N} = nso that i„ = 8\/n ^ log n. 
Lemma lA.ll also establishes that 1 — P(-D2) ^ 211"^^ . Using the notation 
r]n '■= {N^^ logn)^/2, we have, on Di n D2, 

M M rj^ 

||S-S|| < 2c{Y,V\?r,n + 2{Y^X^){l + 2crin)\l- 



^=1 M=l 

+2W— + — + ct„. A.27 

V n n 

Recalling that py = Xy/Xi for u = 1, . . . , M, we have for large n that 

where, say Cy{p) = 2max{(/7,^ — Pi,+i)~^ ^{Pu-i — Pv)~^}- Observe that 
tn/^i = 8r]n\/N/(nXi)'^. Now, substitute ()A.27p to conclude that there are 
functions Bi{p) such that on Dn := DiCi D2, 



— N N N N 

nAi y nXf nXi V nXf 
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Our assumptions imply that 



/logra , A^ A^ N 

so that A,y — )• 0. To summarize, choose c = v2, say, so that on D„, which 
has probabihty at least 1 — 0{n~'^), we have 6'^^^ — ;■ 0. This completes the 
proof of (jATS]). 

Theorem [T] now follows from noticing that L{6y^6y) < 2 and so 

e[l(9^, e^), (Di n z?2)1 < 2Fi{Di n l>2)') = oiN-"^) = o(E II H^se^ f), 

and an additional computation using (JA.IOP which shows that 

E[|| H^Sd, f,D'^] < (E[|| H^Sd, ty/^PiDf,) = o(E[\\ H^Sd, f). 

A. 2 Lower bound on the minimax risk 

In this subsection, we prove Theorems [2] and [3j The key idea in the proofs is 
to utilize the geometry of the parameter space in order to construct appro- 
priate finite dimensional subproblems for which bounds are easier to obtain. 
We first give an overview of the general machinery used in the proof. 

Risk bounding strategy 

A key tool for deriving lower bounds on the minimax risk is Fano 's Lemma. 
In this subsection, we use superscripts on vectors 9 as indices, not exponents. 
First, we construct a large finite subset T of Q^^ {Ci, . . . , Cm), such that the 
following property holds, for a given v £ {1, . . . , M}. 

If 0^,6"^ e T, then L{9l, dl) > 46, for some 5 > (to be chosen). 

This property will be referred to as "4(5-distinguishability in Oi," ■ Given any 
estimator 6 of 0, based on data X„ = (Xi, . . . , Xn), define a new estimator 
(j)(X.n) = ^*5 whose M components are given by 9* = argminQ^jr L{9^, 6^), 
where 9^ is the z^-th column of 0. Then, by Chebyshev's inequality and the 
4(5-distinguishability in 9iy, it follows that 

sup KeL{9^,9^) > 5supP0((/)(X„) / 6»). (A.28) 

eeef(Ci,...,CM) 0GJ- 

The task is then to find an appropriate lower bound for the quantity on the 
right hand sid e of (IA.28I). For this, we use the fo llowing version of Fano's 



lemma, due to iBirge I 200l[, modifying a result of lYang and Barron I |l999l | 
(p. 1570-71). 
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Lemma A. 5 Let {Pq : 9 G 0} be a family of probability distributions on a 
common measurable space, where Q is an arbitrary parameter set. Let Pmax 
be the minimax risk over 0, with the loss function L'{9,6') = le^e'; 

Pmax = inf supPe(r ^9) = inf supEL'(e,r), 
T eee ^ 6»ee 

where T denotes an arbitrary estimator of 9 with values in 0. Then for any 
finite subset T of 0, with elements 9i, . . . ,9j where J = \T\, 

^ - Q log J ^ ^ 

where Pi = ¥g., and Q is an arbitrary probability distribution, and K{Pi,Q) 
is the Kullback-Leibler divergence of Q from P^. 

The following lemma, proven in Section (BJ gives the Kullback-Leibler 
discrepancy corresponding to two different values of the parameter. 

Lemma A. 6 Let 0^ := [9l : . . . : 9-^j^j], j = 1,2 be two parameters (i.e., for 
each j, 91, 's are orthonormal) . Let Sj denote the matrix given by (OP with 
9 = 6^ (and a = 1). Let Pj denote the joint probability distribution of n 
i.i.d. observations from A^(0,Sj). Then the Kullback-Leibler discrepancy of 
P2 with respect to Pi is given by 



ic,,,:=Kie\e') = ^ 



where 77(A) = A/(l + A). 



M M M 

V = \ V = l fJ.= l 



, (A.30) 



Geometry of the hypothesis set and Sphere Packing 

Next, we describe the construction of a large set of hypotheses J-, satisfy- 
ing the 4(5 distinguishability condition. Our construction is based on the 
well studied sphere packing problem, namely how many unit vectors can be 
packed onto S™~^, with given minimal pairwise distance between any two 
vectors. 

Here we follow the construction due to lZong I 19991 ] (p. 77). Let in h 



e a 



large positive integer, and ruQ = [2m/9\ . Define Yj^ as the maximal set of 
points of the form z = (zi, . . . , Zm) in S*"~^ such that the following is true: 

m 

^JmQZi G { — 1, 0, 1} V i, 2^, k*l — V'^o and, for z, z' G Y^, || z — z' ||> 1. 



j=i 



22 



For any m > 1, the maximal number of points lying on S™ ^ such that 
any two p oints are at d istance at least 1, is called the kissing number of an 



m-sphere. IZong I [19991 ] uses the construction described above to derive a 
lower bound on the kissing number, by showing that \Y^\ > (9/8)™^^"'"°^"^^) 
for m large. 

Next, for m < N — M we use the sets Y^ to construct our hypothesis set 
T of same size, |J-"| = \Y*^\. To this end, let {e;j}!^=i denote the standard 
basis of M''^. Our initial set 6 is composed of the first M standard basis 
vectors, ^'^ = [ei : . . . : e^f]. Then, for fixed u, and values of m,r yet to be 
determined, each of the other hypotheses 0^ ^ J- has the same vectors as 6^ 
for k ^ u. The difference is that the z/-th vector is instead given by 

m 

ei = ^/l-r^e, + rY,4^M+u j = 1, • • • , |-F|, (A.31) 

1=1 

where z-' = (zl, . . . ,Zm), j > 1, is an enumeration of the elements of Yj^. 
Thus 6l perturbs e^, in subsets of the fixed set of coordinates {M+1, . . . , M+ 
m}, according to the sphere packing construction for S'"^^. 

The construction ensures that 9l, . . . jOj^j are orthonormal for each j. 
Furthermore, ()A.30P simplifies to 

K{e^,e^) = ^nh{X,){l-{{di,e'jf) = ^nh{X,y, j = 1,...,\T\. (A.32) 

Finally, by construction, for any 6^ ,6 ^ F with j ^ k 

m,el)>r\ (A.33) 

In other words, the set J- is r^-distinguishable in 9y. Consequently, combin- 
ing (fA:28]) and (jXMI) . 

Rl = inf sup EL{e^,e^) > (rV4)[l - a{r,J^)], (A.34) 

9. e,(c) 

with 

„(.,^).i!!M^^^jl±i5iI. (A.35) 

Proof of Theorem [2] 

Let m be an integer yet to be specified and let r E (0, 1). Let Y^ be the 
sphere-packing set defined above, and let T be the corresponding set of 
hypotheses, defined via ()A.3ip . 
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Let ci = log(9/8), then we have log|J^| > bmCim, where 6m —^ 1 as 
771 —7- oo. Inserting the following value for r = r{m), 

nh[X„) 



into Eq. (|A.35P gives that 

air, F) < 



^cim + log 2 
bmCim 



Therefore, so long as rre > "m,*, an absolute constant, we have a(r, J"o) ^ 3/4. 
We need to ensure that 9i S @q{Cu). Since exactly mo coordinates are 
non-zero out of {M + 1, . . . , M + m}, 

\\el\\l = (1 - r2)'?/2 + r'^jjil^"^^ < 1 + aqr'^in^-i/^ 
where Og = (2/9)^-9/2^ a sufficient condition for ol^^ G 6g(C,,) is that 

a^mir^/my/^ < C^. (A.37) 

Substituting (IA.36P puts this into the form 



g/2 



To simultaneously ensure that (i) r^ < 1, (ii) m does not exceed the 
number of available co-ordinates, N — M, and (iii) 9i G Qq{Cu), we set 

m = min{ [nh{X,)\ ,N-M, [AgC^{nh{X,)y/^\}, 

where Ag = l/iagcf ). Recalhng the notations (l9|), (fTOj) and pT]) . this 
becomes (without loss of generality assuming nh{Xy) and ttt-i, to be integers) 

m = minjr" , N\ m,^} = r^ min{l, r^, • minjiV', m^}} 

and Theorem [2] follows. 

Proof of Theorem [3] 

The construction of the set of hypotheses in the proof of Theorem [2] consid- 
ered a fixed set of potential non-zero coordinates, namely {M -|- 1, . . . , M -|- 
m}. However, in the ultra-sparse setting, when the effective dimension is 
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significantly smaller than the nominal dimension N, it is possible to con- 
struct a much larger collection of hypotheses by allowing the set of non-zero 
coordinates to span all remaining coordinates {M -|- 1, . . . , N}. 

In the proof of Theorem [3] we shall use the following lemma, proven in 
Section El Cah Ac {1,... ,N} an m-set if \A\ = m. 

Lemma A. 7 Let k be fixed, and let Ak be the maximal collection of m— sets 
such that the intersection of any two members has cardinality at most k — 1. 
Then, necessarily, 



Let k = [?7T.o/2] -|- 1 and tuq = [13m] with < /3 < 1. Suppose that m, N ^ oo 
with m = o{N). Then 

\Ak\ > ex.p[N£{pm/2N) - 2m£{/3/2)]{l + o(l)). (A.39) 

where £{x) is the Shannon entropy function, 

£{x) = — xlog(x) — (1 — x) log(l — x), < X < 1. 

Let TT be an m— set contained in {M -|- 1, . . . , N}, and construct a family 
Jv by modifying ()A.3ip to use the set vr rather than the fixed set {M -\- 
1, . . . , M -)- m} as in Theorem [2) 



QU,n) ^ ^i_^2 e^ + rJ2^i^i, J = 1, • • • , \Y:^\. 



We will choose m below to ensure that 9^'^ G Qq[C,^). Let "P be a collection 
of sets IT such that, for any two sets vr and tt' in V, the set vr n vr' has 
cardinality at most mo/2- This ensures that the sets Jv are disjoint for 
vr 7^ vr', since each 9u is nonzero in exactly mo + 1 coordinates. This 
construction also ensures that 

Define T := Uttsp -^t- Then 

|jr| = I y j;| = |p| |y^| > |p|(9/8)'"(l+°(l)) . (A.40) 
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By Leiiima [A.7l there is a collection V such that \V\ is at least exp([iViS(?n,/9iV)- 
2m£:(l/9)](l + 0(1))). Since £{x) > -xlogx, it follows from ([XiO]) that, 

> - log — - 2£:(l/9) +log(9/8)(l + o(l))>^logAf + 0(l), 



m \\) m 

since m = 0(A^"'^~"). 

Proceeding as for Theoreni[2l we have log \J-\ > 6m(a/9)mlog A^, where 
6m — )• 1. Let us set (with m still to be specified) 

2 («/9)logJV _2 

Again, we need to ensure that 9^''^' G Qq{C^), which as before is implied 
by ()A.37p . Substituting ()A.4ip puts this into the form 

m<rhy = a~^(Cylfyf. 

To simultaneously ensure that (i) r^ < 1; (ii) m does not exceed the number 
of available co-ordinates, A^ — M; and (iii) dy € Qq{Cy\ we set 

m = min{Lf,-2j ^ ^ _ ^^ \_%\^ll^-f\'\- 

As n, A^ — 7- 00, we have that ra = Ya~^{Cy/fyY\ , and Theorem [3] follows. 

A. 3 Lower bound on the risk of the D.T. estimator 

To prove TheoremHl assume w.l.g. that (0i,dT;^i) > 0, and decompose the 
loss as 

L(ei,DT,0,) =11 e, - Oij f + II Oi^DT - 01,1 f, (A.42) 

where / = li'jn) is the set of coordinates selected by the D.T. scheme and 
9ij denotes the subvector of 61 corresponding to this set. Note that, in 
()A.42p . the first term on the right can be viewed as a bias term while the 
second term can be seen as a variance term. 

We choose a particular vector 61 = 9^ € ®q{C) so that 

E II ^* - 0*,i f> KC'in^^^^~'i/'^\ (A.43) 

This, together with ()A.42p . proves Theorem [J] since the worst case risk is 
clearly at least as large as ()A.43p . Accordingly, set r„ = C"^'^n~4(^~'J/2), 
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where C^ = C^ — 1. Since C'^n'^''^ = o(n^'^), we have r^ = o(l), and so for 
sufficiently large n, we can take r„ < 1 and define 

'yr^72 if fc = i 

e,,k =1^ if 2 < fc < m„ + 1 

^0 if mn + 2<k< N 

where 7tt,.„ = [(l/2)C''?n'^'^J. Then by construction 9^: G @q{C), since 

j; |0,,fcr = (1 - ^n)''/' + r>r''/2 < 1 + r^r.mi-'^/' < 1 + -r^ < C"?, 
fc=i 

where the last inequality is due to q £ (0, 2) and C'? = C — 1. 

For notational convenience, let a„ = jy/logN/n. Recall that D.T. se- 
lects all coordinates k for which S^^ > 1 + a^. Therefore, coordinate k is 
not selected with probability 

,.=P(S..<l+„„)=p(^<^i±;|-) (A.44) 

where Wn ~ Xn- Notice that, for /c = 2, . . . , ?tt,„ + 1, p^ = p2, and 0*^^ = 
for k > rrin + 1. Hence, 

N m„+l 

k=l k=2 

Thus, to finish the proof of Theorem [H it is enough to show that p2 > l — A^ 
for some An that converges to as Ji — )• oo. Rewrite ()A.44p as 

Pk=^{ <l + e/c = 1-IP > 1 + efe where e^ = , . .„ 'n • 

Since |6'=^,2p = r^/m„ = 2n~i/2(;^ _^ o(;^))^ j^ follows that 



/log A'" \ r-„ ^ , 



-m„ 



so that ne| ^ oo as n — )• oo. This, together with (jA.Sp . shows that p2 > 
1 — An where we can choose An = exp(— Sneg/IG) = o(l). 
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B Proof of relevant lemmas 

B.l Proof of Lemma I A. II 

We use the following result on extreme eigenvalues of Wishart matrices by 



Davidson and Szarek I [2001]. 



Lemma A. 8 Let Z be a p x q matrix of i.i.d. N(0, 1) entries with p < q. 
Let s„iax{Z) and Smin{Z) denote the largest and the smallest singular value 
of Z , respectively. Then, 

nSmax{^Z)>l + ^/^q + t) < e-«*'/2^ (A.45) 

IP(s™„(^^)<l- V^-0 < e^'^*'/'- (A.46) 

We apply Lemma IA.8I separately for N < n and for N > n. Observe 
first that, 

A :=|| -ZZ^ - In \\= max{Ai(n-iZZ^) - 1, 1 - AAr(ZZ^)}. 
n 

Consider first N < n and let s± denote the maximum and minimum singular 
values of n~^''^Zi. Define 7(t) := \l N jn + t for t > 0. Then, since A = 
max{s^ — 1, 1 — s?.}, and letting A„(t) := 27(t) + 7(i)^ we have 

{A > A„(t)} C {s+ > 1 + 7(0} U {s_ < 1 - 7(t)}. 
Now, applying Lemma lA.81 with p = N and q = n, we get 

P(A > A„(t)) < 2e-"*'/2. 
We observe that 



An{t) = {N/n + 2y^N/^) + t{2 + t + 2y^N/^). (A.47) 

Now consider N > n. Noting that X]\[{n~^ZZ'^) = 0, we have 

A = max{(iV/n)s^-l,l}. 



This time, let 7(i) := ^/n/N + t and An (t) := max{(A^/n)(l + 7(t))2- 1, 1}. 
We apply Lemma I A. 81 with p = n, q = N, so that 

>A^(t))=PGs+>l + 7(t))<e-"*'/2, 
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and observe that 



A7v(t) = {N/n + 2^/N/^) + (iV/n)t(2 + t + 2^/^). (A.48) 

Thus from (|A.47p and ()A.48p . we have 



An,ax{n,7V}(t) < (N/n + 20V/^) + t(iV„/n)(4 + t). 

Now choose t = Cy/2log Nn/Nn so that tail probabihty is at most 2e~ "* ' ^ = 
2N~'^ . The result is now proved, since if Cy/logn/n < 1 then t(A^„,/n)(4 + 

t) < Ctn- 

B.2 Proof of Lemma IA.6I 

Recall that, if distributions Fi and F2 have density functions /i and /2, 
respectively, such that the support of /i is contained in the support of /2, 
then the Kullback-Leibler discrepancy of F2 with respect to Fi , to be denoted 
by K{Fi,F2), is given by 

K{Fi,F2) = llog^My)dy. (A.49) 

For n i.i.d. observations Xi,i = 1, . . . ,n, the Kullback-Leibler discrepancy 
is just n times the Kullback-Leibler discrepancy for a single observation. 
Therefore, without loss of generality we take n = 1. Since 

M 

^~' = {I-Y,v{K)OuOl), (A.50) 

the log-likelihood function for a single observation is given by 
log fix\e) = -^log(2^)-ilog|S|-ix^S-ix 



— ^lo; 



1 ^' 



2 

u=l 



l({x,x)-f2rii^u){x,euA. (A.51) 
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From ()A.51|) . we have 

^1,2 

= Ef,^ {log f{X\0') -log f{x\e')) 
1 ^^ 

= -Y,v{Kme^{{x,el))'-E,.{{x,el))' 



1 ^^ 



v=l 



M 

2 

M 

2 



1 ^^ 



!/ = l 



3l Il2 






which equals the RHS of ()A.30p . since the columns of 6^ are orthonormal 
for each j = 1, 2. 

B.3 Proof of Lemma IA.7I 

Let Vm be the collection of all m— sets of {1, . . . ,N}, clearly \Vm\ = („)■ 
For any tti— set A, let I{A) denote the collection of "inadmissible" m— sets 
A' for which |A n ^'| > k. Clearly 

i-(-)i<r)(::: 

If Ak is maximal, then Vm = ^AeAk-^i^)^ &iid so ()A.38P follows from the 
inequality 

\Vm\ <\Ak\ max|X(A)|, 

and rearrangement of factorials. 

Turning to the second part, we recall that Stirling's formula shows that 

if k and N — t- oo, 



k) \2Trk{N-k), 



.=xp{iV£(i)}, 



where ^ G (1—(6A;) ^,1+(12A^) ^). The coefficient multiplying the exponent 



/■N\ I lm\'i 



/(T)' 



m (^ ^ ) / ( , J IS 



V2^(l - kjNY^I'^iX - k/m) ~ v^7r/3m(l - /3/2) ^ oo 
under our assumptions, and this yields ()A.39p . 
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Figure 1: Schematic diagram of the D.T. and ASPCA thresholding schemes 
under the single component setting. The vertical lines depict the absolute 
values of the coordinates of the first eigenvector. The threshold for the 
D.T. scheme is 7(log A^/n)^'^ while the thresholds for the ASPCA scheme 
is 7(log A^/n)^'^. The schemes select the coordinates above the upper limits 
(indicated by the multiplier 7_|_) and discard the coordinates below the lower 
limits (indicated by multiplier 7_) with high probability. Here, 7+ > 7 > 
7_ > are generic constants. 
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